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Abstract 

In a population with haploid reproduction any individual has a single parent in the previous 
generation. If all genealogical distances among pairs of individuals (generations from the closest 
common ancestor) are known it is possible to exactly reconstruct their genealogical tree. Unfortu- 
nately, in most cases, genealogical distances are unknown and only genetic distances are available. 
The genetic distance between two individuals is measurable from differences in mtDNA (mitochon- 
drial DNA) in the case of humans or other complex organisms while an analogous distance can be 
also given for languages where it is measured from lexical differences. Assuming a constant rate 
of mutation, these genetic distances are random and proportional only on average to genealogical 
ones. The reconstruction of the genealogical tree from the available genetic distances is forceful 
imprecise. In this paper we try to quantify the error one may commit in the reconstruction of the 
tree for different degrees of randomness. The errors may concern both topology of the tree (the 
branching hierarchy) and, in case of correct topology, the proportions of the tree (length of various 
branches). 

Pacs: 

05. 40. -a -Fluctuation phenomena, random processes, noise, and Brownian motion, 

87.23.Ge -Dynamics of social systems, 

87. 23. Kg -Dynamics of evolution, 

89. 75. He -Networks and genealogical trees. 
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1. INTRODUCTION 



Haploid reproduction implies that any individual has a single parent in the previous 
generation. Since some of the individuals may have the same parent, the number of ancestors 
of the present population decreases going backwards in time until a complete coalescence to 
a single ancestor 0, [s[ Q. Therefore, it is possible to construct a genealogical tree whose 
various branching events connect all the individuals living in the present time to the single 
founder ancestor. The genealogical distance between two individuals is simply the time from 
their last common ancestor and it may assume the maximal value only when the common 
ancestor coincide with the founder. 

In the limit of infinite population size, most of the quantities remain random, for example, 
this is the case of the probability density of genealogical distances in a single population. In 
fact, even in the thermodynamic limit, this quantity varies for different populations or, at 
different times, for the same population. The discovery of this non self-averaging behavior 
is due to the pioneering work of Derrida, Bessis, Jung-Muller and Peliti 0, [3, 

Nevertheless, if all genealogical distances among pairs of individuals are known it is pos- 
sible to exactly reconstruct their genealogical tree. Unfortunately, in practice, genealogical 
distances are unknown unless one has the relatives historical records which is not the case 
of living organisms populations and of most of the linguistic groups (Latin languages are 
an exception). In most cases, only genetic distances are available. These distances, in the 
case of humans or other complex organisms, can be measured from the difference in mtDNA 
(mitochondrial DNA), which is inherited only from the mother and, therefore, it under- 
goes to haploid reproduction P ; [lo| . In the case of languages, instead, they are deduced 

13l . Il4j |. Assuming a constant rate of mutation, these genetic 



from lexical differences 
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distances are random and they are proportional only on average to genealogical distances. 
The reconstruction of the genealogical tree from the available genetic distances is forceful 
imprecise. In this paper we try to quantify the error one may commit in the reconstruction 
of the tree for different degrees of randomness. The errors may concern both topology of 
the tree (the branching hierarchy) and, in case of correct topology, the proportions of the 
tree (length of various branches). The paper is organized as follows: sections 2 is devoted 
to the deterministic process which is associated to the genealogical distances. We also show 
there how to exactly reconstruct the genealogical tree form them. In section 3 we define 
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and discuss the random process associated to the genetic distances. In section 4 we intro- 
duce a measure of topological distance between two tree and we quantify how topologically 
wrong is the tree reconstructed from genetic distances. In sections 5 we quantify the error 
concerning the length of various branches of topologically correct trees. Finally section 6 
contains conclusions and outlook. The paper is completed by an appendix where we have 
moved some lengthy calculations. 

2. DYNAMICS OF GENEALOGICAL DISTANCES 
AND TREES RECONSTRUCTION 

We consider a very general model of a population of constant size N whose generations 
are not overlapping in time: any generation is replaced by a new one and any individual has 
a single parent. The stochastic rule which assigns the number of offspring to any individual 
can be chosen in many ways. In fact, for large population size, results do not depend on the 
details of this rule, the only requirement is that the probability of having the same parent 
for two individuals must be of order 1/N for large N. Here we choose the Wright-Fisher 
rule: any individual in the new generation chooses one parent at random in the previous 
one, independently on the choice of the others. 

The genealogical distance between two given individuals is the number of generations 
from the closest common ancestor. For large N distances are proportional to N, it is then 
useful to re-scale them dividing by N. 

So let us define d(a,(3) as the rescaled genealogical distance between individuals a and (5 in 
a population of size N . For two distinct individuals a and (5 in the same generation one has 



where g(a) and g{(5) are the parents of a and (5 respectively. Accordingly with the Wright- 
Fisher rule, parents are chosen among all possible ones with equal probability 1/N and, there- 
fore, g(a) and g((3) coincide with probability 1/N. In this case the distance d(g(a), g(/3)) 
vanishes. On the contrary, the parents of a and (5 are distinct individuals a 1 and (3' with prob- 
ability (N — 1)/N. The above equation, when considered all the iV(iV — l)/2 pairs, entirely 
defines the dynamics of the population and simply states that the rescaled distance in the 
new generation increases by 1/N with respect to the parents distance. Briefly, d(a, (3) = 1/N 




(1) 



3 



with probability 1/N and d(a,/3) = d(a',(3') + 1/N with probability (N - 1)/N. 

This equation can be iterated for any of the possible iV(iV — l)/2 initial pairs a and (3, 
which correspond to the entries of an upper triangular matrix. The iteration stops when 
there is a coincidence of parents and in this way all the distances d(a,/3) can be calculated. 

Shortly: iteration of equation [T] gives as output the realization of the random N(N — l)/2 
distances d(a,(3) which are the entries of an upper triangular matrix containing all the 
necessary information for the reconstruction of the genealogical tree of the population. The 
tree is completely identified by its topology and by the time separation of all branching 
events. There exist many methods that can be used for this reconstruction, a simple one 
is the Unweighted Pair Group Method Average (UPGMA) [17j|. This algorithm works as 
follows: it first identifies the two individuals with shortest distance, and put their branching 
at their time separation. Then, it treats this pair as a new single object whose distance from 
the other individuals is the average of the distance of its two components. Subsequently, 
among the new group of objects it identifies the pair with the shortest distance, and so on. 
At the end, one is left with only two objects which represents the two main branches, whose 
distance gives the time position of the root of the tree. Then, the time from the last common 
ancestor of all individuals in the populations results fixed. 

This method works for any kind of upper triangular matrix representing distances among 
pairs of individuals, not necessarily originated by the coalescent process. In the coalescent 
case, nevertheless, the method gives the correct tree reproducing the historical branching 
events and the correct time separations among them. Notice, that at any time it chooses 
two individuals with shortest distance. Then, it is easy to realize that for the coalescent the 
distance of the two individuals from any third one is the same. Therefore, in this case, all 
UPGMA averages are between pairs with identical distances so that also the resulting new 
common distances are the same. 

Genealogical trees are very complex objects and genealogical distances are distributed 
according to a probability density which remains random in the limit of large population 15, 



161 ]. Anyway, the mathematical theory of coalescent gives us the ability to deduce some 



important information about their statistical structure. Consider a sample of n individuals in 
a population of size N, where N is very large with respect to n. The probability that they all 
have different parents in the previous generation is Y\H=o (1 — 77 )• Therefore, the probability 
that their ancestors are still all different in a past time t corresponding to tN generations is 



4 



[Ofc=o (1 — j()} tN ■ ^ N i s large compared to n this quantity is approximately exp(— c n t) where 
c n = n % ■ The genealogical tree results from this rule: the average probability density for 
the time lag for the coalescent event for n individuals is p n (t) = c n exp(— c n t) LZI]. 

Therefore, the tree starts at the root with the two main branches, then, the following 
branching event is at a random time lag t 2 with probability density pfa) — e~* 2 and after 
this time the tree has three branches (see Fig. [3]). The next branching is after a time lag £3 
with probability density is p{ts) = 3e~ 3 * 3 and then the tree has four branches, and so on. 

3. RANDOM COALESCENT PROCESS 

As already mentioned, in the coalescent model, genealogical distances measure the time 
from the last common ancestor of two individuals. Nevertheless, in almost all real situations, 
we have to deal with genetic distances reconstructed from directly measurable empirical 
quantities. 

In case of complex organisms, mtDNA is inherited only from the mother and, therefore, it 



undergoes to haploid reproduction, so in this case genetic distances are proportional to the 
number of mutations occurred in the compared mtDNA sequences (see, for example, 

,Q)- 



Analogously languages can be considered as haploid individuals whose vocabulary changes 
accumulate in time, in this case genetic distances can be evaluated by lexical distances [12j, 

0,0. 

In both cases, an individual randomly accumulates mutations at a constant rate and the 
genetic distance of a pair of individuals is the sum of the mutations that they accumulated 
since their last common ancestor (rescaled by N). As a consequence, genetic distances 
are proportional only on average to genealogical ones. Therefore, we have to modify the 
deterministic equation (CQ) in order to take into account this randomness. We may assume 
that increments in the genetic distance have the simple form 

h(a,P) = h(g(a),g(P)) + la + lp (2) 

where g(a) and g{(3) are the parents of a and f3 respectively, while 7^ and 7/3 are random 
variables associated to the mutations of a and j3 . They are zero if the genome of the parent 
is transmitted unaltered and a positive constant if a mutation occurs. We assume that the 
probability is 1 — ^ for zero and ^ for the positive constant -. In a compact form: 



This rule grants that genetic distances are equal to genealogical distances on average, in 
fact, the expected value of the sum 7 a + 7^ is 1/JV. 

Notice that we compare genealogical distances generated by (CQ) with genetic ones gen- 
erated by (j2J). Since they describe two aspects of the same population, the family history 
must be the same. This means that the realization of the part of the process which assigns 
parents in and in (T5]) must also be the same (a — ► <?(«)), the only difference lays in 
the deterministic/random nature of the distance increment. In other words, the parents of 
two individuals a and (3 are unequivocally assigned. Then, their genealogical distance is 
d(a, 0) = d(g(a), g{(3)) + 1/N, while the genetic one is h(a, j3) = h(g(a), g(/3)) + 7^ + 7/3 
where 7 a and 7/3 are the previously defined random variables. 

In order to have a qualitative idea of the differences between the set of genetic distances 
and the set of genealogical ones we plot in Fig. [I] the frequency of the distances of the 
two sets. We have used the same realization of the process for 700 individuals and we 



have chosen /x = 50 for the genetic distances. We can see that while 



may assume only few values where their distribution has spikes 15j, LL6( , the genetic ones 



gene alogical distances 



are dispersed around them. The dispersion decreases when /i increases and when \x = 2N 
genetic distances lose their randomness in mutations, and they equal, for any realization, 
the associated genealogical ones ( compare (PQ) with (T5]) and ). In case of a very large 
population (N — ► 00) the coincidence of the two sets of distances is recovered in the fi — > 00 
limit. 



4. WRONG TREE RECONSTRUCTION: TOPOLOGY 

The problem, in most practical cases, is that genealogical distances are unknown and one 
would like to reconstruct the genealogical tree of a population from the measured genetic 
distances. This is the case of biology where strands of mtDNA are compared as well of 
lexicostatistics where vocabularies or grammar structures take the same role of mtDNA. 

As mentioned in previous section, when \i = 2N equations (pQ) and (j2J) coincide and 
randomness in mutations is lost. In this limiting case genetic and genealogical distances 
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FIG. 1: Frequencies of the genealogical and genetic distances (/i=50) in a population of 700 indi- 
viduals. The realization of the genealogical process (the family history) is the same for the two 
distributions. 



are equal and not only the frequency distributions are identical, but also the family trees 
reconstructed by UPMGA will be exactly the same. For smaller values of /i, we expect 
that the fidelity level of reconstruction of a tree decreases. Then, we would like to have a 
quantitative information on the difference between the trees reconstructed from the matrices 
of genealogical and genetic distances. 

A qualitative understanding of the problem is immediate from Fig. [2] where four trees of 
twenty leaves are reconstructed. The first and correct one with label A, is associated to the 
genealogical distances and the remaining three to the genetic ones for three different values 
of [i. The realization of the genealogical process (the family history) is the same for the four 
pictures. One can see that the quality of the reconstruction decreases for smaller /i. In fact, 
the tree with label D, which corresponds to /i = 100, is topologically quite correct, with a 
couple of wrong clades (see leaves G, B,T,C, A and M, U, E) at the lower level. Also the 
separation times of the branches are not so different from the correct ones. For fi = 50 with 
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label C and /i = 10 with label B the quality of reproduction reduces, clades are wrong also 
at a higher level and times depths are quite different from the right ones. 



-a : : 




FIG. 2: Four trees of twenty leaves, corresponding to a sample of 20 individuals in a population of 
500. The first tree is reconstructed from the genealogical distances and the others from the genetic 
ones (/x=10 for B, [i=50 for C and /i=100 for D). The quality of the reproductions of the first tree 
(the correct one) by the others is lower for smaller \i both for what concerns topology and time 
depths. 



We start the quantitative study of the quality of reconstruction considering the simplest 
situation of tree with three leaves. The topology of a three leaves tree is completely de- 
terminate by the pair of individuals that first match together because their distance is the 
smallest. Consequently, the genealogical and the genetic trees reconstructed by the UPGMA 
will have the same topology if the same pair of individuals has both the smallest genetic 
and genealogical distance. Let us call a, f3 and 7 the three individuals, and assume that 
a and (3 are the pair with the smallest genealogical distance d(a,(3). By the argument in 
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Section 2 we know that d(a,/3) = t 3 and d(a,^) = d(f3,j) = t 2 + t 3 , where and t% are 
independent exponentially distributed variables with average 1 and 1/3 respectively. Then 
let us consider the two following events concerning genetic distances, the first that we call 
A is 



h(a, (3) < min{/i(a, 7); h{/3, 7)} (4) 

If A is satisfied, the topology of the genetic tree reconstructed by UPGMA is the correct 
one since it is the same of that of the genealogical tree. The second that we call B is 



h(a, (3) = mm{h(a, 7); h(p, 7)} 
h(a,j)] ^ /*(/?, 7) 

which corresponds to an ambiguous (but unlikely) situation for UPGMA which will be able 
to reconstruct correctly the tree with probability 1/2. The third that we call C will be 

h(a,/3) = h(a,j)= h(/3,y) (6) 

which is also ambiguous (and even more unlikely). In this case, UPGMA will be able to 
reconstruct correctly the tree with probability 1/3. 

Let us now call P(A \ t 2 , £3) the probability of the event A given the realized values t 2 and 
£3, and P(B \ t 2 , £3) and P(C \ t 2 , £3) the equivalent conditional probabilities for the events 
B and C respectively. Let us also call P(W \ t 2 , t 3 ) the probability of a wrong reconstruction 
of the tree correspondingly to t 2 and £3. We have 

P(W\t 2 ,t 3 ) = 1 - P(A\t 2 ,t 3 ) - ±P(B\h,t 3 ) - ^P(C\t 2 ,t 3 ) (7) 

Now we call n(a) the number of mutations along the branch a divided by fi, as shown 
in Fig. [31 analogously we define n((3), n{pf) and n(a/3) as the numbers of mutations divided 
by fi along the branches indicated in Fig. [31 

We will have h(a, (3) = n(a) + n(j3), h(a,^) = n(a) + n(a(3) + 71(7) and h(/3,j) = 
n{(3) +n(a(3) +n( / y). The advantage is that the four new variables are independent and can 
be obtained as the sum of variables of type ([3]) where the sum goes on a number which is 
iV times the time lag of the associated branch. Namely, t 3 for n(a) and n((3), t 2 for n(af3) 
and t 2 + t 3 for 71(7). 
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FIG. 3: Outline of a three leaves tree. n(a), n(f3), n{~f) and n(af3) are the numbers of mutations 
divided by fi. 



Given this construction we can trivially but painfully compute (see the Appendix) the 
conditional probability P(W \ t 2 , £3) and, then, the absolute probability of a wrong tree 
P(W) as the marginal of the joint probability P(W \ t 2 ^t^jvi^vii^) where ^(£2) and p{t^) 
are the exponential densities previously described. The probability of a wrong tree P(W) is 
plotted in Fig. @] with respect to the parameter \i in the case of three individuals in a large 
population (N — > 00). 

If we take into account more than three individuals the situation immediately becomes 
more complicated since the possible tree topologies increase exponentially with the number 
of leaves. So we need to introduce a measure of difference between the genealogical tree 
and an associated genetic one. The simplest tree distance measure is the Robinson-Foulds 



Symmetric Difference 



11] . which only depends on the topology of the two tree and not on 



the differences in branches length. 

The Symmetric Difference (SD) is computed by considering all possible branches that 
may exist in the two trees. Each inner branch, i.e. a branch connecting two nodes or one 
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FIG. 4: Expected Symmetric Difference between a genealogical tree and the associated genetic tree 
plotted with respect \x. The full line corresponds to the probability of a wrong topology P(W) for 
a three leaves tree while crosses correspond to the ESD of a 20 leaves tree, estimated numerically. 
In the first case, computed exactly, the ESD is twice the probability of a wrong topology. 

node to the root, identifies a clade in the set of leaves. The resulting distance is simply the 
number of clades present in one of the considered trees but not in the other. Therefore, two 
identical trees have zero SD, but it is sufficient to exchange two leaves on one of them to 
have a non zero SD. 

In general SD has not an immediate statistical interpretation, i.e. we cannot say whether 
a larger distance is significantly larger than a smaller one. Anyway, in the particular case 
of trees with only three leaves the expected symmetric distance is twice the wrong topology 
probability P(W). In fact, in a three leaves tree there is only one clade and the Symmetric 
Distance is equal to in the case of correct topology (if both trees have the same clade) and 
is equal to 2 in the case of wrong one (if clades are different). Consequently, in this simple 
case, the expected SD (that we call ESD) is given by the relation 
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ESD = 2 ■ P(W) + • (1 - P(W)) = 2 ■ P(W). 

In order to compute numerically the ESD between a genealogical tree and the associated 
genetic one with parameter /x we use the following procedure: we take 20 individuals in 
a population of 500 (a large one) and we use UPGMA to reconstruct their genealogical 
tree from a realization of the genealogical distances matrix. Then, we construct several 
associated genetic trees (5 for \i < 15, 10 for greater values) and we compute their averaged 
SD with respect to the associated genealogical tree. We start again with a new realization 
of the matrix of genealogical distances and we repeat the procedure, ending with a new 
averaged SD. We do it many times (from 6 for fi = 5 to 30 for // = 100) and, finally, we 
take the mean of all averaged SD and we end with a quantity that should be very close to 
ESD. The number of genealogical trees and that of the associated genetic trees that we use 
for estimating the ESD increases with fi since we observe an increasing fluctuation in the 
SD values. 

In Fig. H] we plot the estimated ESD of the 20 leaves tree. We also plot the exactly 
computed P(W) of a three leaves tree which is one half its ESD. We find out, unexpectedly, 
that they only differ for a factor due to the total number of clades, which depends on the 
number of leaves. 

5. WRONG TREE RECONSTRUCTION: BRANCHES LENGTH 

Now we study the problem of differences among branches length in genealogical and 
genetic trees. We first derive the probability of the error between the genetic and genealogical 
distances of two individuals, and then we will sketch out the computation of the errors for 
the distances in a tree of three individuals. We will consider only the cases in which genetic 
and genealogical tree have the same topology. In this way the integral of the probability of 
the error (errors for the three leaves case) will be equal to the total probability of having 
the right topology P(R) = 1 — P(W). Obviously, the right topology condition is always 
satisfied for two individuals {P{R) = 1). 

In a tree of two individuals we have only a single genealogical distance to be compared 
with a single genetic one. Let us call t the genealogical distance and h the genetic one, that 
is the number of mutations on both the genealogical branches divided by \i. We call k = \ih 
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the total number of mutations on these two branches, then t and h will be equal only if 
k = [it. By using equations ([3]) we can write the expression of the conditional probability 
of having k mutations along the two genealogical branches with lengths tN and, therefore, 
total length 2tN: 



P»(k\2tN) 



' 2tN | ( x _ JJ_y N ~ k e-^jnty 



k 



\2NJ V 2NJ k\ 



(8) 



where the approximation holds for large N. 

One easily gets the conditional averages <k>= [it and <( k— < k >) 2 >= fit. It is than 
straightforward to define the error between genetic and genealogical distance as e = 
in this way, in fact, one gets the conditional (with respect to t) averages <e>= and 
1/fi which do not depend on t. As a consequence of this independence the absolute 
averages of e and e 2 coincide with the conditional ones. The conclusion is that the typical 
error in evaluating the distance from the common ancestor grows linearly with 1 / yjji 

The independence of the two conditional averages from t does not implies that the con- 
ditional probability density for e is itself independent on t. In fact, since one has h = kj ' \i 
and, therefore, e = , the conditional probability density of the error turns out to be 

PM) - t ^ * - )■ (9) 

where the £(•) are Dirac delta functions. This conditional density for e given a genealogical 
distance t, at variance with its two first moments, clearly depends on t. 

Finally, the absolute probability density p M (e) can be calculated as the marginal of the 
joint probability density Pfi{e\t)p(t), where p(t) = exp(— t) is the density of the genealogical 
distances. We obtain 



oo 



Me) = £ — k\ — an- (10) 



where we have to use for I the following definition 



In Fig. [5] Pfi(e) is plotted for some values of /x. In the limit fi^oo the distribution becomes 
a Dirac delta function centered in zero according with the fact that the variance goes to zero 

as 1/y/fl. 
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FIG. 5: The probability distribution of the error p^e) is plotted for fj, = 10, fj, = 20, fj, = 50, 
fi = 100 and \x = 200. The curves become sharper for increasing values of [i. 

Now, let us come back to the case of a genealogical tree with three leaves. The study of 
this kind of structure, even if much simpler than the genealogical tree of an entire population, 
gives us important information since concerns the reconstruction of top of the tree. Indeed, 
to reconstruct in the correct way this part of the genealogical tree, that is the part going from 
the founder to the three most recent ancestors of the entire population, means to rightly 
identify the three main sub-populations and their separation times. 

We have seen the probability of having the correct topology for a three leaves tree from 
the genetic matrix by using UPGMA. In the following, we will give a sketch of the derivation 
of the errors distribution of its two characteristic distances. Hereafter we restrict our analysis 
to the non-ambiguous situation in which the topology of the genetic tree is the same as that 
of the genealogical one. We will refer to the scheme in Fig. [3] and to the notation used in 
Section 4. 

The genetic tree can be characterized by two distances: h = h(a,/3), separating the 
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individuals with minimal genealogical distance t% and H = [h(a,j) + h(/3,j)] /2, calculated 
by UPGMA as the mean value of the major distances and therefore separating the third 
individual from the others. The distance H does not correspond to the maximum genetic 
distance but it is the mean value of the two major distances and gives a better estimate of 
the coalescence time T = t 3 + 1 2 of three individuals to the common ancestor. 

We have seen that on average h coincides with £3, H on average is equal to T, but, in 
general, the genetic distances will be different from the genealogical ones and the branches 
of the tree reconstructed from the genetic matrix by UPGMA will not have the same length 
as those of the genealogical tree. 

Here we define, in analogy with the two leaves case, the errors ei and e 2 of h and H by 
the equations: 

, = h-tz . 

1 ^' (12 ) 
H—T v ' 

The variables e\ and e 2 vanish when the genetic distances equal the genealogical ones. 

In order to compute the probability density of the errors we use the relations introduced 
in previous section and we rewrite the relations (112j) in the form 



_ n(a)+n(f3)-t 3 , 

1 Vfc ' (13) 

62 uf • 

Since 71(a), n(f3) and 71(7) + n(a/3) are independent variables which we described in 
previous section it is straightforward but painful to compute the conditional probability 
Pn = PtM(e u e 2 \t 2 ,t 3 ). 

The sketch goes as follows: first we write the joint conditional probability for the inde- 
pendent variables n(a), n(/3) and n{pf) + n(a/3) as 

pM a ) fo)pMP) I^P/Xt) + n{ap) \ - 2t 2 - t 3 ) (14) 
where the explicit expression for the probabilities p^nlt) is shown in the Appendix (see 



Then we are able to compute p^ = p^ei, e 2 \t 2 , t 3 ) as the sump M = p fl (A)+~p fM (B)+~p fl (C) 
where p^A) is obtained by the sum of the conditional probability (fl4|) over all the triplets 
n(a), n(f3) and 77,(7) + n{a(3) which satisfy condition A of section 4 and relations ([TBI . 
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Analogously, we can compute p jU (5) andp A1 (C). Once we have p M (ei, £2^2,^3) we can compute 
the joint probability 

P^i,e 2} t 2} t 3 ) = Pll (e u e 2 \t 2 ,t 3 )3e tMi (15) 

and after integration over t 2 and £3 we obtain the joint density p M (ei, e 2 ). The normalization 
of this density equals the probability P{R) of correct topology identification by UPGMA. 

More complex tree could be considered in principle, both for what concerns topology and 
branches length, nevertheless, the number of calculations increases exponentially. 



6. CONCLUSIONS AND OUTLOOK 



The inner randomness of genetic mutations is an obstacle for a safe reconstruction of a 
genealogical tree. A wrong reconstruction is more probable the smaller is the probability 
of a mutation. This is a serious problem since in many cases in biology the distances are 
measured by a molecular clock which is obtained comparing short strands of DNA which 

nn 

slowly accumulate errors along reproduction events (see, for example, [91. |10|). 

We are able to quantify, in simple but relevant cases, the probability of a wrong recon- 
struction of a tree, both for what concerns the topology and the proportions. We can, for 
example, give the error concerning the time separation of two species using results in section 
5 and we are also able to decide the probability of a wrong reconstruction for a family tree 
of three species using results in section 4. 

We plan to continue this investigation in order to better quantify the difference between 
the genetic and the genealogical matrices of distances. We think for example to the possi- 
bility of introducing a measure of distance between the associated probabilities of distances. 
Even more important, it is to find a method to estimate the value of the parameter /1 given 
an empirical distribution of genetic distances. This would be useful for situations in which 
the value of \x is not known a p riori as, for example, for the languages in the Indo-European 
and Austronesian groups 12, Q, Q- This method would allow us to use the results of 
this paper for evaluating the fidelity level of phylogenetic trees reconstructed from empirical 
lexical distances. 
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APPENDIX: WRONG TOPOLOGY PROBABILITY P(W) 

In this appendix we will use the notation introduced in Section 3 and shown in Fig. [31 
We have that the probability P(R) of having the correct genealogical tree from the genetic 
distances matrix is: 

P(R) = P{A) + l -P{B) + l -P{C) . (16) 

P(A) is the probability of event A, i.e., the probability of having genetic distances among 
individuals a,j3 and 7 satisfying the inequality (jlj). Using the independent variables (BJ 
rewrites as 

max{n(a); n({3)} < n(a/3) + n(j) ; (17) 

P(B) is the probability of event B, while P(C) is the one of event C . The events B and 
C occur respectively if the genetic distances satisfy the conditions (jSJ) and (El) which rewrite 
respectively as 

max{n(a); n((3)} = n(a(3) + n(j) ; 

(18) 

n(a) 7^ n(p) . 

and 

n(a) = n{(3) = n{a(3) + n( 7 ) . (19) 

Let us define ri\ the maximum between n(a) and n((3), n<i the minimum (rii >n 2 ), and 
n 3 = n(a/3) + ^(7). Then, the probability P(A) corresponds to the probability of having 
rii > ri2 and n\ < 113. Therefore, we have to determine the total probability for the set of 
triplets {ni ; n 2 ; 113} , where the variables rii can take only values multiple of 1/fi and have 
to satisfy the conditions rii e [0, 00) , n 2 G [0, ni) and n 3 6 (m, 00). 

If we consider separately the cases n\ < n 2 and n\ = n 2 we immediately obtain: 

00 m — 1/ fj. 00 00 00 

P(A) = 2 E E E Pn{ni,n 2 ,n 3 )+ £ E P^n u n u n 3 ), (20) 

m=l/ fi U2=0 n3=ni+l ni=0 n 3 =rti+l//i 
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p(ni, n 2 , n 3 ) is the joint probability of having n\n y n 2 fi and n 3 fi mutations respectively on 
the branches (a), (j3) and ((aj3) U (7)) of the tree in Fig.[3j The factor 2 is an exchange 
factor which takes into account the possibility of having n x = n(a) or equivalently n\ = n(/3) 
if n(a) 7^ n(f3). 

Since the three events are independent, using the exponential distributions of the coales- 
cent times t n , i.e. of the length of the branches of the genealogical tree, we have: 

/■OO /"OO 

p(n l ,n 2 ,n 3 ) = 2j J p^nMp^t^p^n^h + t^Se^+^dhdh, (21) 

where the factor 2 comes from the Jacobian of the transformation t' = t'(t 2 , t 3 ) = 2t 2 + t 3 . 

The conditional probabilities in the integral are all of the form p^{n\t) as given by the 
expression 

^ |t)= („J(^) U^' (22) 

which is easily derivable from equation ([H]) and the second approximations holds for large 
populations (N — > 00). 

In the same way one has that the probability P(B) of event B is the probability of having 
rii > n 2 and n 3 = rii. Then we obtain: 

00 nx-X/n 

P(B)=2 £ E P^n^m) (23) 
m=i/fi n2=o 

where the factor 2 is also an exchange factor. 

Finally, for the probability P{C) of event C, that is the probability of having n\ = n 2 = n 3 , 
we can write: 

00 

P(C)= E^K^i), (24) 

ni=0 

where, of course, there is no exchange factor. 

Putting together all the different terms, the resulting expression is: 

/•OO /"OO 

P(R) = 3 / / P(R\t 2 M)e- {zt3+t2) dt 2 dh, (25) 
Jo Jo 

and 



ni—~ 



P(R\t 2 ,t 3 ) = E E E ^(ni\t3)p^n 2 \t 3 )p^n 3 \t 3 + 2t 2 ) + 



m- 



■_± n 2 =U ns=ni + i 
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oo oo 

+ E E P»(nih)\(n3\t 3 + 2t 2 ) + 

ni=0 „ 3=ni + I 
oo ni ~^ 

+ E E ^( n il t 3)p M ( n 2l t 3)p M (ni|t 3 + 2t 2 ) + 
m=i «2=o 

M 

OO 

+ E «^( n il*3)V( w il*3 + 2* 2 ), (26) 

711=0 A 

Finally, the probability of a wrong reconstruction is P(W) — 1 — P(R) ■ 
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